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Abstract — Renyi divergence is related to Renyi entropy mucli 
like information divergence (also called KuUback-Leibler diver- 
gence or relative entropy) is related to Shannon's entropy, and 
conies up in many settings. It was introduced by Renyi as a 
measure of information that satisfies almost the same axioms as 
information divergence. 

We review the most important properties of Renyi divergence, 
including its relation to some other distances. We show how 
Renyi divergence appears when the theory of majorization is 
generalized from the finite to the continuous setting. Finally, 
Renyi divergence plays a role in analyzing the number of binary 
questions required to guess the values of a sequence of random 
variables. 



I. Introduction 

Since Shannon's introduction of his entropy function various 
other similar measures of uncertainty or information have been 
introduced. Most of these have found no applications and 
some have found applications only in quite special cases. An 
exception is formed by Renyi entropy and Renyi divergence, 
which pop up again and again. They are far from being as well 
understood as Shannon entropy and Shannon divergence, and 
do not have as simple an interpretation. Erdal Arikan observed 
that the discrete version of Renyi entropy is related to so-called 
guessing moments |1|. 

In this short note we shall first review the most important 
properties of Renyi divergence in Section In Section |lll] 
we give a very brief introduction to Markov ordering and its 
relation to majorization. Then in Sections HVl and [V] we relate 
Renyi divergence to the theory of majorization. And finally, 
in Section [VT] we will show that, like its entropy counterpart, 
Renyi divergence is related to guessing moments. 

II. Renyi Divergence 

Let P and Q be probability measures on a measurable space 
{X,T), and let p and q be their densities with respect to a 
common cr-finite dominating measure /i. Then for any < 
a < oo except a = 1, the Renyi divergence of order a of 
P from Q is defined as 



considerations lead to the following extensions for a G {0, 1}: 



Dc{P\\Q) = 



1 



a — 1 
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with the conventions that p"q^ °'— if p~q — 0, even for 
a < and a > 1, and that x/0 = oo for x > 0. Continuity 
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where D{P\\Q) = J plogp/q d/i (with the conventions that 
OlogO/x = and xlogx/O = oo if a; > 0) denotes the 
information divergence, which is also known as KuUback- 
Leibler divergence or relative entropy. For a > 0, it was 
introduced by Renyi ||2l, who provided an axiomatic char- 
acterization in terms of "intuitively evident postulates". An 
operational characterizations of Renyi divergence via coding 
has been described |f3l|. 

We will first review some of the basic properties of Da- 
Whenever these properties can easily be derived from known 
results, we will point to the relevant literature. For other 
properties, space requirements limit us to only hint at their 
proofs. A longer version of this paper with full proofs will 
be published elsewhere, and will include results for negative 
values of the order a. 

Let us start by noting that, for finite orders < a 7^ 1, 
Da is a continuous, strictly increasing function of the power 
divergence 



daiP,Q) = 



J p'^q^-'^ dp -I 



a-1 



As da are f -divergences, we may derive properties for Da 
from general properties of /-divergences [4]. 

In particular, Renyi divergence satisfies the data processing 
inequality 

Da{P\g\\Q\g)<Da{P\\Q) 

for any cr-subalgebra Q 'Z T, where P^g and Q\g denote the 
restrictions of P and Q to Q. As a special case, taking Q = 
{0,X} to be the trivial algebra, we find that 

Da{P\\Q)>0. 

P>aiP\\Q) = if and only if P = g. Taking G = cr(P) 
to be the tr-algebra generated by a finite partition V of X, 
the data processing inequality implies that discretizing X 
can only decrease Da- However, because of the following 
property, which carries over from /-divergences. Da may be 
approximated arbitrarily well by such finite partitions: 

Da{P\\Q)= sup DaiPi^iV)\\Q\aiv)) (a > 0) , (2) 

V 



where the supremum is over all finite partitions V of X. 
This characterization also shows that we have found the right 
generalization of Renyi's definition for finite X. 

Using the dominated convergence theorem it can be shown 
that: 

Theorem 1: Da is continuous in a on 

A={Q;|0<a<lor Da{P\\Q) < oo}. 

Da is also nondecreasing in a, and on A it is constant if and 
only if q/p is constant P-a.s. 

The fact that Da is nondecreasing, together with Equation 
(|2]l, implies that limQ-f^i Da — D, as asserted in our defini- 
tion of Di: for finite X, this can be verified directly using 
I'Hopital's rule. Therefore 

\imDa{P\\Q) = supsnpDa{Pu(v)\\Q\a(v)) 

an a<l V 

= sup sup Da{P\a(V) \\Q\a(V)) = D{P\\Q). (3) 
V a<l 

The assertion that lim^j^o — — logQ(p > 0) is verified 
differently, using the dominated convergence theorem and the 
observation that VrnVaioP^q^'" equals g if p > and 
otherwise. Renyi divergence may be extended to a = cxd by 
letting a tend to oo. Then, for finite X, 

and by an interchanging of suprema similar to ^ we find that 

P(A) dP 
Doo{.P\\Q)= log sup — — = log ess sup — (x) 

in general. Consequently, Doo{Q\\P) (note the reversal of P 
and Q) is a one-to-one function of the separation distance 
s(P, Q) — maxj;(l — P{x)/Q{x)), defined only for countable 
X, which has been used to obtain bounds on the rate of 
convergence to the stationary distribution for certain Markov 
chains fSl, f^l. 

Equation [2] implies that there exists a sequence T\,J-2t ■ ■ 
of cr-algebras generated by finite partitions such that 

lim Da{P\TAQ\T:)^Da{P\\Q). 

By the connection to /-divergences, such a convergence result 
holds for any increasing sequence of cr-algebras J-i Q J-2 Q 
•••C J-o, = c7(Ur=i-^n)C-F: 

lim Da{P\rJQ\^J - Da{P\rJ\Q\^J {a > 0) 

(4) 

HI Theorem 15]. By a suitable choice of this result extends 
additivity for any distributions Pi, P2, . . . and Qi, Q2, ■ ■ ., 



N 



DaiPnWQn) ^ DaiPl X ■ ■ ■ X Pn\\Qi X ■ ■ ■ X Q n) , 



from any finite N (for which it is easy to prove) to = cxd 
(if a > 0). For a — additivity only holds for finite N. By 
a direct proof we can also prove the counterpart to (|4| for 
decreasing sequences of cr-algebras ^ J-i ^ J-2 ^ ■ ■ ■ ^ 
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Fig. 1. Example of a Lorenz diagram. 

J'oo = fX^=i ■f'rL (for finite a) under the condition that the 
divergence is finite. 
Let 



H\P,Q)= J{P 



1/2 _ ^1/2)2 d/, = 2-2di/2 (P,Q) 



denote the squared Hellinger distance, and let 



-d^i^d2{P,Q)-l 



denote the -distance 15j|. We see that 

Di/2{P\\Q) = -2 log(l - H\P, Q)/2) 
and D2{P\\Q) = log(l + x^(-P, Q))- Hence by logx < a; - 1 

H\P,Q) < D^/2{P\\Q) < D[P\\Q) 

<D2{P\\Q)<x\P,Q)- 
III. Majorization, Markov ordering and Lorenz 

DIAGRAMS 

The general theory of majorization is now a well established 
mathematical discipline [7]. The majorization lattice and its 
relation to discrete entropy was studied in [8J and later 
generalized in ||9|- Recently a long article on this subject by 
Gorban, Gorban, and Judge has been accepted for publication 
[10]. We refer to these papers for a more complete discus- 
sion and further references. Here we shall relate the relative 
majorization lattice to Renyi divergence. 

Definition 2: Let P and Q be measures on the same mea- 
surable set. The Lorenz diagram of (P, Q) is the range of 



/ ^ fdP, fdQ 



where / is any measurable function with values in [0,1]. 

If Q is the uniform distribution then the Lorenz diagram of 
(PijQ) is a subset of the Lorenz diagram of (P2,Q) if and 
only if P2 majorizes Pi . 



Theorem 3: The Lorenz diagram of {Pi,Q) is a subset of 
the Lorenz diagram of {P2,Q) if and only if there exists a 
Markov operator that transforms P2 into Pi and leaves Q 
invariant. 

Definition 4: Let Pi , P2 and Q be measures on the same 
measurable set X. We write P2 >q Pi if the Lorenz diagram 
of {Pi,Q) is a subset of the Lorenz diagram of {P2,Q) ■ If 
the Lorenz diagrams of {Pi,Q) and {P2,Q) are equal, then 
we write Pi ~q P2. 

This ordering that generalizes majorization will be celled 
the Markov ordering fTofH 

Theorem 5 ([2]): Let Q be a measure on a measurable set 
A". If Q is a uniform distribution on a finite set or if Q has no 
atoms, then M\_ {X) j ~q is a lattice, where M\_ (X) denotes 
the set of probability measures on X. 

The Lorenz diagram is characterized by a lower bound curve 
that is convex and an upper bounding curve that is concave. 
Because of the symmetry around (1/2, 1/2) the Lorenz dia- 
gram is completely determined by the lower bounding curve. 

Definition 6: The Lorenz curve of (P, Q) is the convex en- 
velope of the Lorenz diagram, i.e. the largest convex function 
such that all the points in the Lorenz diagram are at or above 
the curve. 

Proposition 7 (li9j): Let P and Q be measures on the same 
measurable set X. The Lorenz curve of {P, Q) is the convex 
envelop of the points {P (At) , Q (At)) where At are events 
of the form = |a; e A" | ^ < t|. 

In statistics the sets At^lxeAfj^fCtl play the role 
of acceptance sets related to the likelihood ratio test of ratio 
t. The proof of this proposition is therefore essentially the 
same as the proof of the Neyman-Pearson Lemma [9]. Note 
that for discrete measures there will only be finitely many 
different points of the form (P (At) , Q (At)) , and in that case 
the Lorenz curve is piecewise linear For ti < t2 

PiA^)-P{AJ _ P{{-\t^<^<t^}) 

so [P [At) , Q [At)) gives a parametrization of the Lorenz 
curve in terms of its slope if it is differentiable. 

Suppose Q is the counting measure on a finite set X of size 
n, and let Pi = (wi, . . . , w„) be a discrete measure on X. Then 
At is simply {i \ Vi < t} . Let P2 = (u>i, . . . , Wn) be another 
measure and let Bt = {i \ Wi < t}. Then Pi < P2 if and only 
if Pi {At,) > P2{Bt,) whenever g (At J = Q(BtJ. Thus 
Pi d: P2 if and only if the Lorenz curve of (Pi, Q) is above 
the Lorenz curve of (P2, Q). 

If one of the conditions of Theorem |5] is fulfilled, then for 
each convex function / there exists a measure P such that / is 
the Lorenz curve of P. Thus (X) / ~q can be identified 
with the set of Lorenz curves. Let Pi and P2 be measures and 
let Li and L2 be their Lorenz curves. Then Pi A P2 can be 
identified with the Lorenz curve max {Li, L2} and Pi VP2 can 

'in f9l this ordering was called relative majorization. 



A Pi and P2 
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Fig. 2. The met and join of Pi and P2 have Lorenz diagrams that are the 
intersection (dark gray) and the convex hull of their union (light gray). 

be identified with the Lorenz curve that is the convex envelop 
of mill {Pi, £2}- In general this lattice is neither modular nor 
distributive [9|. 

IV. Divergence, Convexity and Ordering 

We will now consider properties of Da{P\\Q) as we vary P 
and Q while keeping a fixed. Information divergence D{P\\Q) 
is known to be jointly convex in the pair (P, Q) Itllj . By an 
argument similar to the proof for Di in ifTTI . this property 
generalizes to Da for arbitrary order < a < 1: 

Theorem 8: For < a < 1, Da{P, Q) is jointly convex in 
the pair (P, Q). 

Even though joint convexity does not generalize to a > 1, 
we still have: 

Theorem 9: For all a, Da{P\\Q) is convex in Q. 

The key step in proving the latter result for a > 1 relies on 
Holder's inequality. 

Let P be absolutely continuous with respect to Q. If F 
denotes the curve that upper bounds the Lorenz diagram, then 
the Renyi divergence is given by 

Da{P\\Q) = ^\og l\F'{t)T di. 

Note that we can replace the upper bounding function by the 
lower bounding function (the Lorenz curve) without changing 
the integral. 

Theorem 10: For a > the Renyi divergence Da {P\\Q) 
is a increasing function of P on the lattice corresponding to 
Q. 

Proof: Let F and G be concave functions on [0, 1] such 
that F <G and P (0) = G (0) = and P (1) = G (1) = 1. 
Let a; i-^ be a Markov kernel such that x — J y d^x{y) 
for all X G [0, cxd[. Then 

J G(y) d$,(2/) <G(^Jy d$,(y)^ = G (x) . (5) 



Consider the set of all Markov kernels a; i-> $j; such that 
X ^ Jy d^^{y) and F {x) < J G [y) d^^{y) for all x e 
[0; oo[. This set is convex and contains an element such that 
F{x) = jGiy) d$,(2/). Then F' (x) = / G' (y) d<&,(y) 
and the theorem follows from Jensen's inequality. ■ 

Theorem [TOl is essentially a noisy data processing inequality 
because the Markov kernel in the proof essentially maps 
the measure corresponding to G into the measure correspond- 
ing to P. By adapting a proof from [9J is possible to prove 
the following theorem: 

Theorem 11: Let Pi and P2 denote distributions that are 
absolutely continuous with respect to Q. If Markov ordering is 
taken with respect to Q then power divergence is sub-modular 
and super-additive, i.e. 



Proof: If Pn P in total variation for n ^ 00 then 
the Lorenz diagram of Pn tends to the Lorenz diagram of 
P in Hausdorff distance. Let F, F and F„ denote the upper 
bounding functions for P, P and P„. Then for any e > 
eventually F^ (t) < mm |^ (t) ,F{t + e)^ for all t e [0, 1] . 
Hence 

lim sup Da (PnWQ) 

n— J-oo 

This holds for all e > and, since the right-hand side tends to 
^ log Jo (^F {t)f dt = Da {P\\Q) for e ^ 0, the result 
follows. ■ 



da (Pi, g) + da (P2, Q) > da (Pi A P2, Q) + da (Pi V P2, Q) 

and 

da {Pi,Q) + da (P2, g) < da (Pi A P2, Q) . 

Since power divergence is a function of Renyi divergence 
one can reformulate Theorem[TT]in terms of Renyi divergence. 
Like Renyi divergence, the power divergence da{P,Q) tends 
to the information divergence D{P\\Q) as a t 1- This implies: 

Corollary 12: Let Pi and P2 be distributions that are ab- 
solutely continuous with respect to Q. If the Markov ordering 
is taken with respect to Q then information divergence is sub- 
modular and super-additive, i.e. 

(Pi ||g) + (P2 HQ) > (Pi A P2 HQ) + (Pi V P2 HQ) 
D {PiWQ) + D {P2\\Q) < D {Pi A P2\\Q) . 

V. Continuity of Renyi divergence 

The type of continuity of Da in the pair (P, Q) turns out to 
depend on the topology and on a. We consider the r-topology, 
in which convergence of P„ to P means that P„(A) — P{A) 
for all A E F, and the total variation topology in which Pn 
P if the variation distance between P„ and P goes to zero. 
In general the total variation topology is stronger than the 
T-topology, but if X is countable, then the two topologies 
coincide. 

Theorem 13: For any a > 0, ^^(PlIQ) is a lower semi- 
continuous function of {P,Q) in the r-topology. 
Moreover: 

Theorem 14: For < a < 1, Da{P\\Q) is a (uniformly) 
continuous function of {P,Q) in the total variation topology. 

It remains to consider a = 0. In this case: 

Corollary 15: Dq{P\\Q) is an upper semi-continuous func- 
tion of (P, g) in the total variation topology. 

Using the Markov ordering we get more insight. 

Theorem 16: If a > 1 and Da{P\\Q) < 00, then the 
Renyi divergence Da {P\\Q) is continuous in P on the set 
|p I P p| when the set of probability measures is 
equipped with the topology of total variation. 



VI. Guessing moments 

Erdal Arikan observed that the discrete version of Renyi 
entropy is related to so-called guessing moments fT). In this 
short note we shall see that Renyi divergences are also related 
to guessing moments. 

Definition 17: Let Pi and P2 denote probability measures 
on X. We say that Pi is a rearrangement of P2 if 

Q[xEX\'-^ix)>t]^Q[xEX\^^ix)>t 

for all t e R. 

Definition 18: A guessing function in is a function g : 
X such that Q {{x \ g {x) <t}) <t for t e [0, 1] . 

For a probability measure P on A" with density ^ we are 
interested in bounds on the moments of guessing functions. 
For a guessing function g the p-th moment is given by 



{g{x)r dP{x) 



i/p 



Definition 19: Let P be a probability measure on X. For 
each Radon-Nikodym derivative the ranking function r of 



^ is given by 



rix) = Q 



dP , , dP , , 

dQ^y^^dQ^^^ 



We note that if F is the distribution function of 4^ then the 
ranking function is given by r (x) = 1 — P (x). The ranking 
function is a guessing function. 

0({a:|.-W<t)) = 

Note that Q {{x \ r (x) <t}) =t for all t e [0, 1] if and only 
if the distribution of the random variable 4^ is continuous. 

dQ 

Proposition 20: The ranking function is the guessing func- 
tion that minimizes the p-th moment if p > and maximizes 
the p-th moment if p < 0. 

Guessing and ranking are closely related to majorization 
and the Markov ordering via the following proposition. 



Proposition 21: Assume that Pi,P2 and Q are probability 
measures on X and Pi <q P2- Let ri and r2 denote the 
ranking functions of Pi and P2- Then 

\\ri\\p<\\r2\\p ifp>0, 
lki||^> llrsll^ ifp<0. 

Lemma 22: If a = > then, for any probability 

measures P and Q, 



-log{\\rl)>D^{P\\Q), 



where the p-norm is calculated with respect to Q and r is the 
ranking function of ^ . 
Proof: We have 



r (x) 



1 dQ{y) = I r dQ{y) 



< 



dQ{y) 




dQ{y) 



We get 

, //(^f (y))" dQ(y)\ dP 



dQ 



dp \" 

We raise to the power l/p and take minus the logarithm and 
get 



log(i;[r(Xn^) <log 




(x) dQix) 



IdQ^"^)) dQix)] =-D^iP\\Q). 



Using additivity of Renyi divergence and Lemmal22lwe get 
the following theorem. 

Theorem 23: If a = > then for any i.i.d. sequence 

^ iXi,X2,..., Xn) eX"^ we have 

-ilog(||r(Xr)||J >i?„(P||Q). 

This bound is asymptotically tight as stated in the following 
theorem. 

Theorem 24: If a = > then for any i.i.d. sequence 
= (Xi, ^2, . . . , Xn) e'x'^ we have 

lim ~-\og(\\riX-)l)^D^iP\\Q). 
The result gives a new interpretation of Renyi divergence. 



VII. Discussion 

The results in this short paper are formulated under the 
assumption that the second argument Q in (PjlQ) is a 
probability measure. Nevertheless many of the results still hold 
if Q is a more general positive measure. For instance many 
results on Renyi entropy are obtained when Q denotes the 
counting measure. Most of these results for Renyi entropy 
are well-known. Results for differential Renyi entropy are 
obtained when Q is the Lebesgue measure. For both Renyi 
entropy and differential Renyi entropy many results should 
first be formulated and proved for subsets of finite measure 
and then one should take a limit for an increasing sequence of 
subsets. In this sense our results on Renyi divergence are often 
more general than the results one will find in the literature. 

We have related Renyi divergence to majorization and 
Markov ordering. An interesting related concept is catalytic 
majorization. It has been proved by M. Klimesh that one 
discrete distribution majorizes another distribution if and only 
if certain inequalities hold between their Renyi entropies |12J. 
A similar result is still to be proved for Renyi divergence. 
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